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Abstract 

A variety of different network technologies and topologies are currently being 
evaluated as part of the Whitney Project. This paper reports on the implementation 
and performance of a Fast Ethernet network configured in a 4x4 2D torus topology 
in a testbed cluster of “commodity” Pentium Pro PCs. Several benchmarks were 
used for performance evaluation: an MPI point to point message passing bench- 
mark, an MPI collective communication benchmark, and the NAS Parallel Bench- 
marks version 2.2 (NPB2). Our results show that for point to point communication 
on an unloaded network, the hub and 1 hop routes on the torus have about the same 
bandwidth and latency. However, the bandwidth decreases and the latency 
increases on the torus for each additional route hop. Collective communication 
benchmarks show that the torus provides roughly four times more aggregate band- 
width and eight times faster MPI barrier synchronizations than a hub based net- 
work for 16 processor systems. Finally, the NPB2 benchmarks, which simulate 
real-world CFD applications, generally demonstrated substantially better perfor- 
mance on the torus than on the hub. In the few cases the hub was faster, the differ- 
ence was negligible. In total, our experimental results lead to the conclusion that 
for Fast Ethernet networks, the torus topology has better performance and scales 
better than a hub based network. 


1. Work performed under NASA Contract NAS 2-14303 


1 


1.0 Introduction 

Recent advances in “commodity” computer technology have brought the 
performance of personal computers close to that of workstations. In addition, 
advances in “off-the-shelf’ networking technology have made it possible to 
design a parallel system made purely of commodity components, at a fraction of 
the cost of MPP or workstation components. The Whitney project, being 
performed at NASA Ames Research Center, attempts to integrate these 
components in order to provide a cost effective parallel testbed. 

One of the key components of Whitney is the means of interconnecting the 
processors. There are many custom, semi-custom, and commodity technologies 
available for networking. These include Ethernet, Fast Ethernet, Gigabit 
Ethernet, Myrinet, HiPPI, FDDI, SCI, etc. The most attractive of these choices, 
however, is currently Fast Ethernet, due to its good performance and extremely 
low cost. 

Combining a large number of systems into a high performance parallel computer 
requires the careful selection of both network technology and topology. The 
Whitney project is currently evaluating different network technologies and 
topologies in a testbed cluster of “commodity” Intel Pentium Pro PCs. This 
paper will report on the implementation and performance of Fast Ethernet, both 
in a single hub and in a 4x4 routed 2D torus 2 topology. 

The remainder of this paper is organized as follows. Section 2 will provide the 
configuration details for the networks we tested. In section 3, the actual hardware 
configuration of the testbed system will be discussed. Section 4 presents the 
results of the experiments. Finally, section 5 presents final conclusions along 
with directions for further research. 

2.0 Network Configuration 

Fast Ethernet [Iee95] is a ten times faster version of the original Ethernet 
standard. The increase of the bit rate to 100 million bits per second (Mbps) and 
modifications to the physical layer of the Ethernet standard are the only major 
changes. This has greatly helped manufacturers in bringing products to market 
quickly and also has created a large consumer market because of Ethernet's 
familiarity. As a result, the price of Fast Ethernet equipment has fallen 
dramatically since its introduction. A typical PCI Fast Ethernet adapter costs 
$50-$80, and hubs cost approximately $75 per port. In addition, because the 
most common physical layer for Fast Ethernet (i.e., lOObaseTX) utilizes 
inexpensive cabling technology, category 5 unshielded twisted pair (UTP), 
wiring costs are also very low. 


2. For the purposes of this paper, the 4x4 routed 2D torus tested will often simply be referred to as a torus. 
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2.1 Connection Options 

To build a Fast Ethernet network, machines must be attached using either a hub, 
switch, or “crossover” cable. In a hub, all systems share a single broadcast 
network, so only one host can send and one or more hosts may receive at one time. 
When more than one host attempts to use the network at the same time, a 
“collision” occurs. The systems then retry their messages using a “carrier sense 
media access with collision detection” (CSMA/CD) algorithm with exponential 
backoff. This mechanism for handling shared network access is common to all 
Ethernet based systems. This means that in a hub connected system the maximum 
bisection bandwidth is limited to 100Mbps (12.5 MBytes/sec), and is often lower 
when more than one host is contending for access, regardless of the number of 
nodes in the network. While this is hardly adequate for a parallel system, we 
performed measurements on this configuration to see how it would perform. 

To increase the bisection bandwidth of the system, one must increase the number 
of simultaneous connections possible and “break” the ethemet into multiple 
segments. This can be done either with an Ethemet switch or by adding TCP/IP 
routers. The advantage of Ethemet switching is that there still appears to be a 
single Ethemet network, though it will now support multiple simultaneous senders 
and receivers. In addition, some Ethemet switches allow nodes to operate in “Full 
duplex” mode where they simultaneously send and receive data. This is especially 
useful for acknowledgment and flow control packets that must flow from a 
receiver to a sender. The disadvantage, however, is that Ethemet switches are 
expensive, S300-S700 per port, and they do not scale past 100-200 nodes. Further, 
switches do have a limited bisection bandwidth, though they can typically deliver 
l-2Gbps of aggregate bandwidth. 

A second choice, however, is to utilize TCP/IP based routing where either some or 
all nodes forward packets between subnets. This scheme increases the aggregate 
bandwidth of the network without purchasing additional switching hardware (the 
nodes are the switches). In addition, if nodes are attached directly using 
“crossover” cables rather than hubs, full duplex operation is possible. However, 
router nodes must have more than one Ethemet card, nodes must spend CPU time 
forwarding packets between other nodes, and the performance of TCP/IP routing 
is usually lower than that of Ethemet switches. 

In this paper we chose to test both a hub connected system and a routed topology. 
The topology we chose, a 2D toms, requires all nodes to perform routing. Further, 
because links are implemented with crossover cables (i.e., the network does not 
include any hubs), all connections can operate in full duplex mode. 

The 2D toms was chosen for two reasons. The first reason was scalability, a mesh 
or toms network can be expanded to any size system by increasing either one or 
more dimension. This is particularly important because the planned size for 
Whitney is 400-500 nodes. In addition, by increasing both dimensions not only is 
the size of the mesh increased, but also the bisection bandwidth. The only 
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limitation is that as the size increases, so does the diameter of the network. We 
chose to minimize this effect by keeping the mesh square and providing the 
wraparound connections. 

The second reason for choosing a 2D torus was for physical and cost reasons. 
The nodes we used in the experiments had only 5 PCI slots. Utilizing single port 
Ethernet cards, this means that no more than 5 other systems may be attached to 
each node. While there are two and 4 port Ethernet cards, the per port cost is 2-4 
times the cost of single port cards. Because we wanted an arbitrarily scalable 
network, we could not use a hypercube (we could only have up to 2 5 , 32, nodes), 
and we would need 6 links for a 3D mesh/torus. 

2.2 Torus Network 

Figure 1 illustrates the 2D torus configuration. Each of the sixteen nodes was 
directly linked to its four nearest-neighbors via a 100 Mbs bidirectional Fast 
Ethernet connection. Thus, the torus was partitioned into thirty-two distinct 
TCP/IP subnetworks. 



FIGURE 1. 16 node 2D torus network configuration 


Links between neighbor interfaces (for the torus configuration) used standard 
category 5 unshielded twisted pair wiring that was crossed over (null modem). 
The wiring was tested and certified for 100 Mbs operation to ensure good 
connections. All links were direct so no dedicated hubs, routers, repeaters, 
switches, or other devices were used in the torus. 

In addition to the topology depicted in Figure 1, an additional node was 
connected to a fifth network interface in node 1 . It's major functions were to 
serve as a front-end for starting jobs on the cluster and to work as an NFS server 
for the processing nodes. Shared disk I/O, while important in a production 
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system, was not a significant factor in any of the benchmarks which were used in 
this paper. The final Whitney system will have a parallel file system 
implemented across multiple I/O nodes. 

2.2.1 Hub 

For the hub experiments, all nodes were attached to two “stacked” Bay Networks 
Netgear FE 516 Hubs. By stacking the two 16 port hubs they act like a single 32- 
port hub. Each node had only a single Ethernet card and all nodes plus the front 
end were on a single TCP/IP subnet. 

3.0 The Whitney Prototype 

3.1 Hardware 

The Whitney prototype consisted of 30 nodes (though only 16 were used in these 
experiments) with the following hardware: 

• Intel Pentium Pro 200MHz/256K cache 

• ASUS P/I -P65UP5 motherboard, Natoma Chipset 

• ASUS P6ND CPU board 

• 128 MB 60ns DRAM memory 

• 2.5 GB Western Digital AC2250 hard drive 

• 4 Cogent/A daptec ANA-69 1 1 /TX ethemet cards 3 

• Trident ISA graphics card (used for diagnostic purposes only) 

For this paper, we chose to concentrate on a TCP/IP routed network of systems. 
In addition we also performed experiments where all nodes were attached to a 
single Hub. Subsequent research will evaluate the cost/performance trade-offs of 
Ethemet switching as well as hybrid network schemes. 

3.2 Software 

Red Hat Linux 4. 1 4 (RHL) was installed on each of the processing nodes. The 
kernel included with RHL, version 2.0.27, was replaced with the newest version 
at the time - 2.0.30. The kernel was compiled with ip forwarding turned on so 
that the routing mechanism of Linux could be used. Both the de4x5 v0.5 and the 
tulip v0.76 Ethemet drivers were tested. The de4x5 driver was used initially and 
exhibited some inconsistent performance characteristics. The final toms configu- 
ration on which all benchmarks were run used the tulip development driver. 

A script executed at boot-time configured the Ethemet interfaces in each node. 
Another program set up the routing tables on each node with static routes to non- 


3. Node 1 contained an additional ethemet card. The additional card was connected to the front-end node. 

4. Red Hat Linux is available from http://www.redhat.com. 
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local subnets using an X-Y routing scheme. Packets addressed to non-neighbor 
nodes were forwarded through the appropriate interface towards their destina- 
tion. The shortest-hop distance was maintained in all cases. 

The MPI message passing system [Mes94] was used for communication between 
processors. MPICH (version 1.1.0) [GrL96] was the specific MPI implementa- 
tion utilized. It was built using the P4 device layer, so all communication was 
performed on top of TCP sockets. Programs were started on the mesh by the 
mpirun program [Fin95] which resided on the front-end. mpirun takes the name 
of the program and the number of processing nodes to use and then remotely 
spawns the appropriate processes on the mesh. All of the benchmarks mentioned 
in this report used MPI for communication. 

4.0 Performance 

The first bechmark run on the torus measured the message latency and band- 
width of point to point links. It was useful for evaluating the performance degra- 
dation of the different route distances in the torus. The second benchmark 
measured the performance of collective communication. Finally, the NAS Paral- 
lel Benchmarks version 2.2 were run. These are a set of benchmarks that approx- 
imate the MPI performance of a parallel architecture on “real world” tasks (i.e., 
CFD codes). 

4.1 Point to point message passing 

To measure point-to-point message passing performance, a MPI ping-pong 
benchmark was utilized. This benchmark simply sent a message of a fixed size 
from one node to another than back. The time for this operation was then divided 
by two to get the time to send a message one way. The message size was varied 
from 1 byte to 1 Mbyte, and all experiments were repeated 20 times. Figure 2 
illustrates the point to point message send/receive time from node 1 to each of 
the other nodes in the torus configuration. As can be seen from this graph, the 
message passing performance delineates itself in to 4 categories. These 4 
categories represent the number of hops each node is from node 1 . Therefore, the 
lowest transmission time is from node 1 to its adjacent neighbors, 2, 4, 5, and 13. 
The second category is nodes that must be communicated to through node 1 ’s 
neighbors (they are 2 hops away), i.e., 3, 6, 8, 9, 14, and 16. The third category 
are nodes 3 hops away, i.e., 7, 10, 12, and 15, and the final category is nodes 4 
hops away for which there is only one, node 1 1 . Similar performance curves can 
be generated for any other node pair, with similar results based on the nodes 
distance. 

Figure 3 depicts the message passing time for the hub configuration. Here only a 
single line is shown because all nodes are of equal distance. Therefore, the 
performance is similar to the nearest neighbors in Figure 2. 
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Transmission Time vs. Message Size for a Torus 
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FIGURE 2. Point to point torus message passing time from node 1 to N 


Transmission Time vs. Message Size fora Hub 



FIGURE 3. Point to point hub message passing performance on a hub 
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4.1.1 Latency 

To determine the latency of message passing Figures 4 and 5 depict the message 

Latency on a Torus 

|a 1-2 



Message Size L i_16 


FIGURE 4. Transmission time for small messages on a torus 


Latency on a Hub 
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FIGURE 5. Transmission time for small messages on a hub 

passing time for small messages. As you can see from these graphs, latency for a 
single hop on the torus or for the hub are about 1 75 |Llsecs. Then, each hop on the 
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torus adds about 40 jisecs, so the latency for 2 hops is 215 jisecs, 3 hops is 255 
(Jisecs, and 4 hops is 295 (isecs. 

4.1.2 Bandwidth 


MPI Bandwidth vs. message size is shown in Figures 6 and 7. As can be seen 


Bandwidth vs. Message Size for a Torus 
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FIGURE 6. Bandwidth performance of the torus topology 


from these graphs, Ethernet bandwidth is quite erratic. However, some patterns 
can be seen. As expected, bandwidth for small message sizes is low, building to a 
sustained bandwidth of approximately 8-8.5 MB/sec for one hop on the torus or 
on the hub. For nodes more than one hop away on the torus, the bandwidth drops 
about 1.5 MB/sec per hop (8.5 MB/sec, 7 MB/sec, 5.5 MB/sec, 4 MB/sec). Also, 
note that the bandwidth reaches peak performance at an 8K message size, then it 
drops down and starts to build to peak slowly as message size approaches 1 MB. 
This anomaly is likely due to either the Ethernet or TCP packet size. 

4.2 Collective Communication 

To measure the performance of collective communication, a MPI broadcast 
benchmark was utilized. The benchmark measured the time required to broad- 
cast a message to a given set of nodes and perform a MPI barrier synchroniza- 
tion. Message sizes used for the broadcast were varied between 1 and 32768 
bytes in 2 A n steps. Each message size was broadcast 20 times. 
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Bandwidth vs. Message Size for a Hub 



FIGURE 7. Bandwidth performance on a Hub 


4.2.1 Bandwidth 

The aggregate bandwidth of the Torus for collective communication is depicted 
in Figure 8. Our experiments show that for message sizes below 1024 bytes the 
aggregate bandwidth is very poor. Both the Ethernet frame size and the TCP 
packet size could be possible causes for this. Above the 1024 byte threshold, per- 
formance becomes much closer to expected levels. The maximum aggregate 
bandwidth was observed to be about 43 MB/s for the 4096 byte message size. 
While the theoretical maximum aggregate bandwidth of the toms should be 400 
MB/s, this does almost reach the maximum bisection bandwidth (50MB/s). Fur- 
ther, it is quite good given the cost of software routing, processor overhead, 
TCP/IP overhead, etc. In general, the aggregate bandwidth increases as the num- 
ber of nodes increases for a given message size. The exceptions are probably due 
to inconsistencies in routing latency and network contention. 

The collective communication bandwidth for the hub is shown in Figure 9. The 
cut-off for good performance is still at about 1024 bytes, however performance 
isn’t as poor below this size as was seen in the toms. Aggregate bandwidth 
increases regularly as the number of nodes increases for messages up to 5 12 
bytes. As can be seen, performance is very irregular for larger message sizes and 
the maximum aggregate bandwidth is about 9MB/s. Both the irregular 
performance and the low maximum bandwidth are probably a result of collisions 
on the shared 100 Mb/s network. 
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Broadcast Bandwidth vs. Number of Nodes for a Torus 



FIGURE 8. Collective communication performance on a torus 
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FIGURE 9. Collective communication performance on a hub 
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4.2.2 Barrier Synchronization Time 

A comparison between the hub and torus barrier synchronization time is shown 
in Figure 10. Clearly, the torus provides significantly faster barrier synchroniza- 

Barrier Sync Time vs. Number of Processors 



FIGURE 1 0. Comparison of barrier synchronization time for the hub and 

torus 

tion than the hub as the number of processors increases. Also, the hub perfor- 
mance is much more inconsistent. This may provide an explanation for the hubs 
sporadic aggregate bandwidth performance (Figure 9). 

4.3 The NAS Parallel Benchmarks 

Computational Fluid Dynamics (CFD) is one of the primary fields of research 
that has driven modem supercomputers. This technique is used for aerodynamic 
simulation, weather modeling, as well as other applications where it is necessary 
to model fluid flows. CFD applications involve the numerical solution of non- 
linear partial differential equations in two or three spacial dimensions. The gov- 
erning differential equations representing the physical laws governing fluids in 
motion are referred to as the Navier-Stokes equations. The NAS Parallel Bench- 
marks [BaB91] consist of a set of five kernels, less complex problems intended 
to highlight specific areas of machine performance, and three application bench- 
marks. The application benchmarks are iterative partial differential equation 
solvers that are typical of CFD codes. 

In this section, we show results for the NPB 2.2 codes [BaH95] which are MPI 
implementations of the NAS Parallel Benchmarks. The NPB 2.2 benchmark set 
includes codes for the three application benchmarks, BT, SP, and LU. It also 
includes code for 4 of the five original kernel benchmarks, EP, FT, MG, and IS (it 
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BT Benchmark - Total MFLOPS 



FIGURE 11 . Comparison between hub and torus topologies for BT 

benchmark 


SP Benchmark - Total MFLOPS 



FIGURE 12. Comparison between hub and torus topologies for SP 

benchmark 


does not include CG). Full results for these codes are shown in the appendix of 
this paper. Benchmarks were compiled with the Portland Group’s Fortran 77 
compiler, pgf77, using the options: -O -Knoieee -Munroll -Mdalign 
- tp p6. These benchmarks were run for all valid sizes that would fit on the 
available nodes, This included 1, 2, 4, 8, and 16 processors for LU, FT, MG, and 
IS because they required processor counts that were a power of two. BT and SP, 
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LU Benchmark - Total MFLOPS 



FIGURE 13. Comparison between hub and torus topologies for LU 

benchmark 


however, required sizes that were perfect squares, so they were run for 1 , 4, 9, 
and 16 processors. Note that in the appendix single processor times are only 
shown for the hub, though they should be the same for the torus since the 
network is not used. In addition, we measured performance of the torus for both 
for 4 processors in a row (nodes 1, 2, 3, and 4 of Figure 1) and for a 2x2 layout 
(i.e., nodes 1, 2, 5, and 6 on Figure 1). This made a minor difference in 
performance, however it may be important on larger systems. 

In Figures 11, 12, and 13 the performance of the three NAS application 
benchmarks on a hub and torus is compared. The first thing you will notice from 
the graphs is that in many cases the hub does not perform as poorly as one might 
expect, particularly for the Class A benchmarks. In most cases the performance 
of the torus was better than the hub. In the few cases where the hub was better, 
the difference was negligible. Also, as expected, differences between the hub and 
mesh increase as the number of processors increases, due to contention on the 
hub. 

Of the application benchmarks, LU has the highest performance, 328 MFLOPS 
for a 16 processor hub and 402 MFLOPS for a 16 processor torus. This result is 
typical of the measurements we have made on Ethernet networks, i.e., LU’s 
network characteristics seem to match nicely with Ethernet. BT also performs 
well, 282 MFLOPS for the hub and 323 MFLOPS for the torus, though it is 
significantly slower than LU. SP performs the worst, with less than half of the 
performance of LU. This would indicate that while some algorithms do match 
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well to the performance characteristics of Ethernet, others perform significantly 
worse. 

Similar differences can be seen in the kernel benchmark results as well. IS 
performs particularly poorly on Ethernet. EP also performs poorly in an absolute 
sense, but scales well, so its performance loss is likely due to the compilers or 
libraries available on our system. FT and MG perform reasonably, but still suffer 
from communication costs, especially on the hub network. 

5.0 Conclusion 

Our experimental results have shown that in the Whitney testbed cluster, a Fast 
Ethernet torus exhibits more desirable performance characteristics than a Fast 
Ethernet hub network. Collective communication tests show that for a 16 proces- 
sor system, aggregate bandwidth is more than 4 times higher on a torus than a 
hub. Furthermore, many of the NPB2 results for the torus showed significant 
performance increases over the hub as the problem size and the number of pro- 
cessors were scaled. No NPB2 result showed more than a negligible perfor- 
mance advantage for using the hub. 

There is also strong evidence that a torus will scale much more regularly than a 
hub network for larger processor numbers. Our results have shown that 
collective communication bandwidth is very sporadic on the hub relative to the 
torus performanc e. Also, the MPI barrier synchronization time for the torus was 
shown to scale more regularly and be much less than in a hub topology. These 
results are evidence of the hub’s shared 100 Mb/s bandwidth becoming 
overloaded. The fine-grained segmentation of the torus largely prevents this 
problem. 

In conclusion, Fast Ethernet configured in a torus topology has been shown to 
have better performance and to scale better than a hub based network. Future 
studies are planned to evaluate other network technologies and topologies in the 
Whitney testbed cluster. Fast Ethernet switching, for example, is interesting 
because it eliminates the processing load of software routing, provides a high 
degree of network segmentation, and has the ease of use of a hub. Myrinet is 
another promising network technology planned for evaluation. This paper 
provides a useful foundation on which to make these future comparisons. 
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Appendix: NAS Parallel Benchmark Results 
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